1 Executive Summary

Skeleton ToC so heading numbers match full report

2 Introduction

3 Upscaling methods

Code and results for this section. Full report at:

NB This is a bare bones .Rmd file. Please see the report for full details.

3.1 Data

3.1.1 Green Grid

Load the data - sourced from https://reshare.ukdataservice.ac.uk/853334/

We use the GREEN Grid sample of ~ 40 New Zealand dwellings/households to give worked examples of the scaling methods described.

# half-hourly total household consumption (pre-prepared)
totalF <- paste0(rmdParams$ggPath, "halfHourImputedTotalDemand_Fixed.csv.gz")

# half-hourly lighting consumption
lightingF <- paste0(rmdParams$ggPath, "halfHourLighting.csv.gz")

# half-hourly heat pump consumption
heatPumpF <- paste0(rmdParams$ggPath, "halfHourHeatPump.csv.gz")

# household attribute data
hhF <- paste0(rmdParams$ggPath, "ggHouseholdAttributesSafe_2019-04-09.csv.gz")

# weights created using IPF in SPATIALEC
hhWf <- paste0(rmdParams$ggPath, "ipf/nonZeroWeightsAu2013.csv.gz")

Now prep and check the data:

# > power ----
totalDT <- data.table::fread(totalF)
setkey(totalDT, linkID)
message("N households in total")
## N households in total
uniqueN(totalDT$linkID)
## [1] 40
heatPumpDT <- data.table::fread(heatPumpF)
message("N households with monitored heat pump circuits")
## N households with monitored heat pump circuits
uniqueN(heatPumpDT$linkID)
## [1] 29
setkey(heatPumpDT, linkID)

lightingDT <- data.table::fread(lightingF)
message("N households with monitored lighting circuits")
## N households with monitored lighting circuits
uniqueN(lightingDT$linkID)
## [1] 23
setkey(lightingDT, linkID)

powerDataPrep <- function(dt){
  dt[, meanConsumptionkWh := (meanPowerW/2)/1000]
  dt[, r_as_dateTime := lubridate::as_datetime(r_dateTimeHalfHour)]
  dt[, r_dateTimeNZ := lubridate::with_tz(r_as_dateTime, "Pacific/Auckland")]
  dt[, r_date := lubridate::as_date(r_dateTimeNZ)]
  dt[, r_halfHour := hms::as_hms(r_dateTimeNZ)]
  dt[, r_year := lubridate::year(r_dateTimeNZ)]
  dt <- addNZSeason(dt, dateTime = "r_dateTimeNZ")
  return(dt)
}

totalDT <- powerDataPrep(totalDT)
heatPumpDT <- powerDataPrep(heatPumpDT)
lightingDT <- powerDataPrep(lightingDT)

Now check the data and remove any recommended for exclusion on quality grounds (rf_46)

# check
#table(totalDT$circuit)
t <- totalDT[, .(mean_kWh = mean(meanConsumptionkWh),
                 nHouseholds = uniqueN(linkID),
                 startDate = min(r_date),
                 endDate = max(r_date)), keyby = .(r_year)]

ft <- set_caption(flextable(t),
            caption = "Data availability")
flextable::autofit(ft)
Table 3.1: Data availability

r_year

mean_kWh

nHouseholds

startDate

endDate

2014

0.4482226

21

2014-01-06

2014-12-31

2015

0.4599591

39

2015-01-01

2015-12-31

2016

0.4295491

30

2016-01-01

2016-12-31

2017

0.4519507

22

2017-01-01

2017-12-31

2018

0.4420814

18

2018-01-01

2018-08-01

totalDT <- totalDT[r_year == 2015 & linkID != "rf_46"]
heatPumpDT <- heatPumpDT[r_year == 2015 & linkID != "rf_46"]
lightingDT <- lightingDT[r_year == 2015 & linkID != "rf_46"]

We have data for multiple years. For simplicity we will keep only the data for 2015 when we have the largest number of households (see Table 3.1).

Check that the time of day demand profiles look OK. Figure 3.1 shows overall kWh profiles for all dwellings with electricity use monitoring.

# check - beware which hemisphere we are in?
table(totalDT$month, totalDT$season)
##      
##       Spring Summer Autumn Winter
##   Jan      0  29054      0      0
##   Feb      0  27908      0      0
##   Mar      0      0  35090      0
##   Apr      0      0  51364      0
##   May      0      0  52157      0
##   Jun      0      0      0  47024
##   Jul      0      0      0  46487
##   Aug      0      0      0  45521
##   Sep  43036      0      0      0
##   Oct  43633      0      0      0
##   Nov  39667      0      0      0
##   Dec      0  38840      0      0
testDT <- totalDT[, .(meankWh = mean(meanConsumptionkWh)), 
                  keyby = .(r_halfHour, season, r_year)]

ggplot2::ggplot(testDT, aes(x = r_halfHour, y = meankWh, colour = season)) + 
  geom_line() +
  labs(caption = "Whole household mean kWh per half hour") +
  facet_grid(. ~ r_year)
Overall kWh profile by season

Figure 3.1: Overall kWh profile by season

Figure 3.2 shows lighting kWh profiles for all dwellings with electricity use monitoring.

testDT <- lightingDT[, .(meankWh = mean(meanConsumptionkWh)), 
                  keyby = .(r_halfHour, season, r_year)]

ggplot2::ggplot(testDT, aes(x = r_halfHour, y = meankWh, colour = season)) + 
  geom_line() +
  labs(caption = "Lighting mean kWh per half hour") +
  facet_grid(. ~ r_year)
Lighting profile by season

Figure 3.2: Lighting profile by season

Both Figure @ref(fig:checkProfile_all) and Figure 3.2 show the time of day profiles we would expect.

Process and check household data

# > household ----
hhDT <- data.table::fread(hhF)
hhDT <- code_Q7(hhDT)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
hhDT[, hasPV := `PV Inverter`]
hhDT[, hasPV := ifelse(hasPV == "", "No", hasPV)]
hhDT[, hasSurvey := "Yes"]

hhDT[, nPeople := nAdults + nChildren0_12 + nTeenagers13_18]
hhDT[, nPeople_Factor := ifelse(nPeople > 1, "2", "1")]
hhDT[, nPeople_Factor := ifelse(nPeople > 2, "3", nPeople_Factor)]
hhDT[, nPeople_Factor := ifelse(nPeople > 3, "4+", nPeople_Factor)]

# check
message("Number of households in survey data")
## Number of households in survey data
uniqueN(hhDT$linkID)
## [1] 44
setkey(hhDT, linkID)
setkey(totalDT, linkID)

dt <- totalDT[, .(nObs = .N), keyby = .(linkID)]
dt[, hasElecMonitor := "Yes"]
setkey(dt, linkID)
hhDT <- merge(hhDT, dt, all = TRUE) # keep all the linkIDs
t <- hhDT[, .(n = .N), keyby = .(hasSurvey, hasElecMonitor)]
set_caption(flextable(t),caption = "Number of households with/without survey and monitoring data")
Table 3.2: Number of households with/without survey and monitoring data

hasSurvey

hasElecMonitor

n

Yes

2

Yes

8

Yes

Yes

36

Note that not all households in the survey data have electricity monitoring data and vice versa.

Now check for 0 and negative values in the dwelling overall total kWh.

# any zeros & negative numbers?
#hist(totalDT$meanConsumptionkWh)
ggplot2::ggplot(totalDT, aes(x = meanConsumptionkWh)) +
  geom_histogram() +
  labs(caption = "Mean half-hourly data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# some

# check if aggregated to daily sum
dailyAllDT <- totalDT[, .(sumkWh = sum(meanConsumptionkWh), # daily sum
                          meankWh = mean(meanConsumptionkWh) # daily mean
                          ), 
                      keyby = .(r_date, linkID, season)]

ggplot2::ggplot(dailyAllDT, aes(x = sumkWh)) +
  geom_histogram() +
  labs(caption = "Daily sum of half-hourly data")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# still some neg - probably due to PV?

# check if aggregated to household 
dt <- dailyAllDT[, .(meankWh = mean(sumkWh),
                     sdkWh = sd(sumkWh)), keyby = .(linkID)]
ggplot2::ggplot(dt, aes(x = meankWh)) +
  geom_histogram() +
  labs(caption = "Mean of all half-hourly data per household")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# still some

There are negative kWh values at all levels of aggregation.

Check if the negative values are to do with PV.

# need to check -ve = mid-day, if not is not PV must just be errors?
totalDT[, testValues := "> 0"]
totalDT[, testValues := ifelse(meanConsumptionkWh == 0, "0", testValues)]
totalDT[, testValues := ifelse(meanConsumptionkWh < 0, "< 0", testValues)]
testDT <- totalDT[hhDT[, .(linkID, hasPV)]]
dt <- testDT[testValues == "< 0", .(nValues = .N),
                  keyby = .(r_halfHour, season, linkID, testValues,hasPV)]

p <- ggplot2::ggplot(dt[!is.na(testValues) & !is.na(season)], aes(x = r_halfHour, y = nValues, colour = linkID)) +
  geom_line() +
  facet_grid(season ~ testValues + hasPV) +
  labs(x = "Half hour",
       y = "N",
       caption = paste0("All -ve values, Yes = PV reported in survey\n",
                        "Colours = individual dwellings, legend omitted for clarity")
  ) +
  theme(legend.position="none")
p
Count of negative kWh values by time of day and whether reported presence of PV

Figure 3.3: Count of negative kWh values by time of day and whether reported presence of PV

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:flextable':
## 
##     style
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plotly::ggplotly(p)

Figure 3.3: Count of negative kWh values by time of day and whether reported presence of PV

so: a) neg values indicate PV - although at least one household who reported having PV (“Yes”) apparently did not (rf_23) b) low levels of negative values where PV is not reported (’No") may indicate monitoring device errors (rf_14 and rf_26)

If we only care about network load we could keep all values > 0 only. If we want total energy input then we need grid draw + PV input which we might be able to calculate from the PV circuit W - grid export but it gets complicated.

So for now we’ll leave out the dwellings which seem to have PV

setkey(dailyAllDT, linkID)
dailyDT <- dailyAllDT[hhDT[hasPV == "No"]]
ggIPF_AU_DT <- data.table::fread(hhWf)
skimr::skim(ggIPF_AU_DT)
Table 3.3: Data summary
Name ggIPF_AU_DT
Number of rows 78078
Number of columns 6
_______________________
Column type frequency:
character 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
AU2013_label 0 1 4 33 0 1859 0
REGC2013_label 0 1 12 24 0 17 0
linkID 0 1 5 6 0 42 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
AU2013_code 0 1 554468.19 33426.10 500100 524711.0 551030.00 584936.00 622201.00 ▇▇▅▇▅
nMBs 0 1 24.75 16.54 1 11.0 22.00 35.00 124.00 ▇▅▁▁▁
ipfWeight 0 1 19.84 37.87 0 3.3 9.88 20.64 1973.35 ▇▁▁▁▁

3.1.2 MBIE Electricity statistics

Source: https://www.mbie.govt.nz/building-and-energy/energy-and-natural-resources/energy-statistics-and-modelling/energy-statistics/electricity-statistics/

mbieDF <- readxl::read_xlsx(paste0(rmdParams$mbiePath, "/MBIE_Electricity_extract_2015.xlsx"))
## New names:
## * `` -> ...1
mbieDT <- as.data.table(mbieDF)
setnames(mbieDT, "...1", "Variable")
set_caption(colformat_num(flextable(mbieDT), digits = 2),
            caption = "MBIE data extract for 2015")
Table 3.4: MBIE data extract for 2015

Variable

Q1 2015

Q2 2015

Q3 2015

Q4 2015

Net Generation (GWh)1,2

10,160.94

10,760.95

11,519.74

10,598.11

Hydro

5,305.53

6,102.22

6,809.89

6,067.01

Geothermal

1,851.91

1,927.84

1,882.43

1,892.51

Biogas

57.36

61.69

62.45

62.48

Wood

86.70

88.60

81.73

91.71

Wind

459.17

619.16

587.16

675.01

Solar3

8.81

6.11

7.67

13.55

Oil

0.31

0.73

0.37

0.04

Coal

668.59

320.36

301.76

462.30

Gas

1,710.77

1,621.84

1,773.87

1,321.09

Waste Heat4

11.80

12.40

12.40

12.40

Total Consumption(GWh)5

9,513.73

10,035.25

11,024.87

10,025.26

Residential consumption (GWh)

2,427.96

3,141.21

4,103.25

2,900.94

Residential Number of ICPs5

1,702,555.93

1,717,234.06

1,725,995.65

1,736,972.82

Residential Average consumption per ICP (kWh)5

1,426.07

1,829.23

2,377.33

1,670.11

Mean daily kWh per ICP

15.85

20.32

26.41

18.56

resCons <- mbieDT[Variable %like% "Residential"]

dt <- melt(resCons)
## Warning in melt.data.table(resCons): id.vars and measure.vars are internally
## guessed when both are 'NULL'. All non-numeric/integer/logical type columns are
## considered id.vars, which in this case are columns [Variable, ...]. Consider
## providing at least one of 'id' or 'measure' vars in future.
annualResGWh <- sum(dt[Variable == "Residential  consumption (GWh)"]$value)
rmdParams$NZ_nResICPs_MBIE_2015 <- mean(dt[Variable == "Residential Number of ICPs5"]$value)

dailyResMeanTotalGWh <- annualResGWh/365
dailyResMeanKWh <- 1000000*(dailyResMeanTotalGWh/rmdParams$NZ_nResICPs_MBIE_2015)

# NB: Mean daily kWh per ICP = average q total/90 (days)

We use this data below to validate the upscaled results. The quarterly residential consumption figures are used directly and we also calculate:

  • mean annual daily residential GWh: 34.45 GWh.
  • mean daily kWh per ICP (household): 20.02 kWh

3.1.3 Census 2013

This data comprises family, household or dwelling counts (depending on the variables of interest) at area unit level. The data was downloaded from the Stats NZ census table download tool and processed using code developed as part of the EU H2020 Marie Skłodowska-Curie scheme-funded SPATIALEC Global Fellowship hosted by the University of Otago.

f <- paste0(rmdParams$censusPath, "nResidents/TABLECODE8169_Data_f4c79560-8c41-4ff9-b861-5598e6e0b50d.csv")
censusDT <- data.table::fread(f)

censusOccupancyDT <- censusDT[Area == "Total, New Zealand by regional council/area unit" &
                `Number of usual residents in household` %like% "usual"]

ft <- set_caption(flextable(censusOccupancyDT[Year == "2013", .(`Number of usual residents in household`, n = Value)]), caption = "Number of households by occupancy (2013)")
autofit(ft)
Table 3.5: Number of households by occupancy (2013)

Number of usual residents in household

n

One usual resident

355284

Two usual residents

527676

Three usual residents

254862

Four usual residents

235332

Five usual residents

106329

Six usual residents

41184

Seven usual residents

15402

Eight or more usual residents

13824

censusOccupancyDT[, nPeople := `Number of usual residents in household`]
censusOccupancyDT[, nPeople_Factor := ifelse(nPeople %like% "One", "1", "4+")]
censusOccupancyDT[, nPeople_Factor := ifelse(nPeople %like% "Two", "2", nPeople_Factor)]
censusOccupancyDT[, nPeople_Factor := ifelse(nPeople %like% "Three", "3", nPeople_Factor)]
#censusOccupancyDT[, .(n = sum(Value)), keyby = .(nPeople_Factor,nPeople)]

rmdParams$NZ_nHHs_Census_2013 <- sum(censusOccupancyDT[Year == "2013"]$Value)

Table 3.5 shows the number of households by household size. We will use this later.

# might not need this
f <- paste0(rmdParams$censusPath, "2013/2013IpfInput.csv")

censusIPFInputAUDT <- data.table::fread(f)
skimr::skim(censusIPFInputAUDT)
Table 3.6: Data summary
Name censusIPFInputAUDT
Number of rows 2020
Number of columns 34
_______________________
Column type frequency:
character 2
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
AU2013_label 0 1 4 34 0 2020 0
REGC2013_label 0 1 12 24 0 17 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
AU2013_code 0 1.00 558310.19 36014.20 500100 526176.50 555350.0 587302.25 666673.00 ▇▆▇▃▁
nRooms1_4 8 1.00 121.33 182.91 0 24.00 72.0 165.00 4122.00 ▇▁▁▁▁
nRooms5_6 8 1.00 312.30 263.44 0 84.00 261.0 489.00 1758.00 ▇▅▁▁▁
nRooms7m 8 1.00 342.56 285.93 0 108.00 297.0 504.75 1992.00 ▇▃▁▁▁
nrooms_statedtotalHouseholds 8 1.00 731.16 589.34 0 219.00 643.5 1131.00 4968.00 ▇▃▁▁▁
nrooms_totalHouseholds 8 1.00 776.29 623.96 0 233.25 685.5 1197.00 5568.00 ▇▃▁▁▁
calcSum 8 1.00 776.19 624.19 0 231.00 685.5 1194.75 5562.00 ▇▃▁▁▁
calcSumPc 131 0.94 98.50 12.46 0 99.69 100.0 100.28 137.50 ▁▁▁▇▁
nPeople_1 8 1.00 176.58 168.71 0 48.00 132.0 261.00 1797.00 ▇▁▁▁▁
nPeople_2 8 1.00 262.21 217.39 0 84.00 228.0 390.75 2199.00 ▇▂▁▁▁
nPeople_3 8 1.00 126.64 106.58 0 33.00 108.0 198.00 948.00 ▇▂▁▁▁
nPeople_4m 8 1.00 204.70 182.03 0 54.00 168.0 306.00 1167.00 ▇▃▁▁▁
npeople_totalHouseholds 8 1.00 770.32 618.78 0 231.00 681.0 1188.75 5367.00 ▇▃▁▁▁
i.calcSum 8 1.00 770.13 618.81 0 231.00 684.0 1191.00 5364.00 ▇▃▁▁▁
i.calcSumPc 124 0.94 98.39 12.52 0 99.69 100.0 100.24 125.00 ▁▁▁▇▃
nKids_1m 8 1.00 261.57 220.29 0 69.00 222.0 402.00 1335.00 ▇▅▂▁▁
nkids_totalFamilies 8 1.00 564.81 452.99 0 168.00 502.5 861.00 2460.00 ▇▅▂▁▁
nKids_0_families 8 1.00 303.24 247.00 0 96.00 262.5 453.00 1728.00 ▇▅▁▁▁
i.calcSum.2 8 1.00 564.81 452.99 0 168.00 502.5 861.00 2460.00 ▇▅▂▁▁
i.calcSumPc.2 134 0.93 100.00 0.00 100 100.00 100.0 100.00 100.00 ▁▁▇▁▁
fuel_totalHouseholds 8 1.00 776.29 623.96 0 233.25 685.5 1197.00 5568.00 ▇▃▁▁▁
fuel_totalStatedHouseholds 8 1.00 733.37 590.50 0 219.00 646.5 1131.00 4920.00 ▇▃▁▁▁
heatSourceCoal 8 1.00 30.30 62.10 0 6.00 15.0 30.00 930.00 ▇▁▁▁▁
heatSourceElectricity 8 1.00 592.64 498.79 0 156.00 492.0 918.00 3342.00 ▇▅▁▁▁
heatSourceGas 8 1.00 200.97 205.12 0 45.00 138.0 300.00 1371.00 ▇▂▁▁▁
heatSourceOther 8 1.00 76.61 98.01 0 18.00 51.0 105.75 2202.00 ▇▁▁▁▁
heatSourceWood 8 1.00 269.51 245.25 0 96.00 213.0 375.00 1764.00 ▇▂▁▁▁
i.calcSum.1 8 1.00 1170.03 898.84 0 393.00 1071.0 1773.75 5922.00 ▇▅▂▁▁
i.calcSumPc.1 131 0.94 154.81 25.82 0 141.04 153.8 167.98 233.33 ▁▁▂▇▁
nMBs 0 1.00 23.09 16.90 1 9.00 20.0 34.00 124.00 ▇▅▁▁▁
nKids_0 8 1.00 508.75 433.10 0 152.25 430.5 771.00 4929.00 ▇▁▁▁▁
nkids_totalHouseholds 8 1.00 770.32 618.78 0 231.00 681.0 1188.75 5367.00 ▇▃▁▁▁
f <- paste0(rmdParams$censusPath, "2013/censusAu2013IpfInput.csv")

censusIPFInputAU_reducedDT <- data.table::fread(f)
skimr::skim(censusIPFInputAU_reducedDT)
Table 3.6: Data summary
Name censusIPFInputAU_reducedD…
Number of rows 2020
Number of columns 26
_______________________
Column type frequency:
character 2
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
AU2013_label 0 1 4 34 0 2020 0
REGC2013_label 0 1 12 24 0 17 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
AU2013_code 0 1.00 558310.19 36014.20 500100 526176.50 555350.0 587302.25 666673 ▇▆▇▃▁
nrooms_totalHouseholds 8 1.00 776.29 623.96 0 233.25 685.5 1197.00 5568 ▇▃▁▁▁
nRooms1_4 163 0.92 5.96 24.65 0 0.00 3.0 6.00 780 ▇▁▁▁▁
nRooms5_6 162 0.92 136.73 110.18 0 45.00 114.0 204.00 768 ▇▃▁▁▁
nRooms7m 162 0.92 135.55 101.55 0 51.00 117.0 198.00 639 ▇▅▂▁▁
nrooms_statedtotalHouseholds 142 0.93 783.33 575.53 6 297.00 702.0 1164.00 4968 ▇▃▁▁▁
count.NA 162 0.92 48.85 46.13 0 15.00 36.0 69.00 600 ▇▁▁▁▁
npeople_totalHouseholds 8 1.00 770.32 618.78 0 231.00 681.0 1188.75 5367 ▇▃▁▁▁
nPeople_1 155 0.92 190.49 167.50 0 60.00 150.0 273.00 1797 ▇▁▁▁▁
nPeople_2 157 0.92 283.19 212.37 0 114.00 249.0 405.00 2199 ▇▂▁▁▁
nPeople_3 161 0.92 137.06 104.23 0 48.00 120.0 204.00 948 ▇▃▁▁▁
nPeople_4m 161 0.92 126.59 100.35 0 42.00 108.0 183.00 744 ▇▃▁▁▁
nkids_totalFamilies 8 1.00 564.81 452.99 0 168.00 502.5 861.00 2460 ▇▅▂▁▁
nKids_1m 197 0.90 182.94 137.34 3 69.00 162.0 264.00 987 ▇▅▁▁▁
nKids_0_families 197 0.90 440.05 307.64 6 177.00 402.0 636.00 2010 ▇▆▂▁▁
fuel_totalHouseholds 8 1.00 776.29 623.96 0 233.25 685.5 1197.00 5568 ▇▃▁▁▁
heatSourceElectricity 148 0.93 624.43 482.04 0 213.00 537.0 930.75 3279 ▇▅▁▁▁
heatSourceGas 163 0.92 95.18 166.51 0 3.00 9.0 114.00 1110 ▇▁▁▁▁
heatSourceWood 156 0.92 290.91 242.28 0 120.00 234.0 393.00 1764 ▇▃▁▁▁
heatSourceCoal 161 0.92 32.79 63.97 0 6.00 15.0 33.00 930 ▇▁▁▁▁
heatSourceOther 163 0.92 12.38 15.10 0 3.00 9.0 15.00 258 ▇▁▁▁▁
fuel_totalStatedHouseholds 142 0.93 785.70 576.58 6 294.75 705.0 1170.00 4920 ▇▃▁▁▁
nMBs 0 1.00 23.09 16.90 1 9.00 20.0 34.00 124 ▇▅▁▁▁
nKids_0 197 0.90 673.04 488.46 12 267.00 603.0 976.50 5271 ▇▂▁▁▁

3.2 Re-weighting

3.2.1 Multiplication of point estimates

In this example we simply multiply point estimates by the population size. In this case we:

  • calculate mean daily kWh for 2015
  • calculate mean daily kWh by quarter (with 95% confidence intervals) for 2015. We use quarter instead of season so that we can validate against MBIE quarterly data (see below);
  • multiply both by the the 2015 NZ Statistics household count projection of 1,796,000 households to give national estimates of daily residential electricity use.

3.2.1.1 Daily

Table 3.7 shows the mean daily kWh and the estimated 95% confidence intervals for this mean calculated from Green Grid sample. The table also shows the same result for a simulated HEEP2 full sample size of 280 to illustrate the effect of increasing sample size (only) on precision c.f (Anderson et al. 2020). In this case we assume that the mean and s.d. are unchanged. As noted in WP1, there is no ‘right’ value for this sample size - it depends what level of (im)precision can be accommodated in the analytic results as will be made clear below.

baseDT <- dailyDT[!is.na(sumkWh), .(mean = mean(sumkWh),
                  sd = sd(sumkWh),
                  nSample = uniqueN(linkID))]
baseDT[, ci_lower := mean - (sd/sqrt(nSample))*qnorm(0.975)]
baseDT[, ci_upper := mean + (sd/sqrt(nSample))*qnorm(0.975)]
baseDT[, source := "GreenGrid"]

baseDTHEEP2 <- copy(baseDT)
# simulate HEEP2 full
# really we should do a proper re-sampling etc but...
baseDTHEEP2[, ci_lower := mean - (sd/sqrt(280))*qnorm(0.975)]
baseDTHEEP2[, ci_upper := mean + (sd/sqrt(280))*qnorm(0.975)]
baseDTHEEP2[, source := "Simulated HEEP2 sample size"]
baseDTHEEP2[, nSample := 280]

dt <- rbind(baseDT,baseDTHEEP2)
dt[, ci_range := ci_upper - ci_lower]

ft <- colformat_num(flextable(dt[, .(source, nSample, mean, sd,
                                ci_lower, ci_upper, ci_range)]), digits = 2, na_str = "N/A")
ft <- colformat_num(ft, j = "nSample", digits = 0, na_str = "N/A")
set_caption(autofit(ft), 
            caption = "Mean daily household kWh (2015)")
Table 3.7: Mean daily household kWh (2015)

source

nSample

mean

sd

ci_lower

ci_upper

ci_range

GreenGrid

33

22.76

12.66

18.44

27.08

8.64

Simulated HEEP2 sample size

280

22.76

12.66

21.27

24.24

2.97

# for text
pcOverEstimate <- 100*((baseDT$mean/dailyResMeanKWh)-1) # % mean higher than MBIE mean
ci_range <- baseDT$ci_upper - baseDT$ci_lower # CI range
pcOfMean <- ci_range/baseDT$mean

As we can see the Green grid data produces a mean daily kWh value which is ~14% higher than the MBIE derived figure of 20.02. However we can also see that the MBIE value is included within the very large Green Grid 95% CI which range from 18.44 to 27.08 or 9 kWh/day (~ 38 % of the mean)!

On the other hand, while the HEEP2 simulation shows the consequences of increasing sample size on precision (much narrower 95% confidence interval), the MBIE value now lies outside (below) the 95% lower confidence interval limit. We would therefore suspect that the Green grid sample may have higher than average consumption.

Table 3.8 shows the result of scaling this daily mean by the estimated number of New Zealand households (1,721,000). The first row uses the original Green Grid mean and confidence intervals while the second row uses the simulated HEEP2 confidence intervals.

dt[, GWh := (mean * rmdParams$NZ_HHs)/1000000]
dt[, GWh_lower := (ci_lower * rmdParams$NZ_HHs)/1000000]
dt[, GWh_upper := (ci_upper * rmdParams$NZ_HHs)/1000000]
dt[, nNZHHs := rmdParams$NZ_HHs]
dt <- dt[, GWh_CI := GWh_upper - GWh_lower]

ft <- colformat_num(flextable(dt[, .(source, nSample,nNZHHs,
                                GWh, GWh_lower, GWh_upper, 
                                GWh_CI)]), digits = 2, na_str = "N/A")
ft <- colformat_num(autofit(ft), j = c("nSample", "nNZHHs"),
                    digits = 0, na_str = "N/A")

set_caption(ft, 
            caption = "Mean total daily household GWh (scaled, 2015)")

Table 3.8: Mean total daily household GWh (scaled, 2015)

source

nSample

nNZHHs

GWh

GWh_lower

GWh_upper

GWh_CI

GreenGrid

33

1,721,000

39.16

31.73

46.60

14.87

Simulated HEEP2 sample size

280

1,721,000

39.16

36.61

41.72

5.10

In both cases the mean daily residential GWh (39.16) is considerably higher than the MBIE derived total of 34.45 GWh/day (see Section 3.1.2) reconfirming our suspicion that the Green grid sample is biased towards higher usage households. This is likely to be due to the preferential recruitment of larger households with electricity as a main energy source and heating via heat pumps for the Green Grid study (Anderson et al. 2018).

As we would expect (second row in Table 3.8), increasing the sample size to 280 increases precision. Nevertheless, the 95% CI of ~ 5 GWh/day is roughly equivalent to the New Zealand daily wind generation of 6.5 GWh in Q3 2015 (c.f. Table 3.4), a level of imprecision that may still preclude use of the estimate for practical purposes. As noted above and as discussed in WP1 report at length, the level of precision required should dictate the sample size required.

3.2.1.2 Daily by quarter

To extend the example, we next estimate the same values but by quarter to compare with the MBIE data. Table 3.9 shows the mean daily kWh per household by season. NA values indicate the households for whom no electricity monitoring data exists. The table also shows the lower and upper 95% confidence intervals calculated using the sample sd and n.

For validation purposes we use the quarterly per ICP kWh consumption values for 2015 (see Table 3.4):

  • Q1 2015: 1,426
  • Q2 2015: 1,829
  • Q3 2015: 2,377
  • Q4 2015: 1,670

These have been converted to mean daily values for comparison purposes as shown in Table 3.9 and Figure 3.4.

Figure 3.4 shows the Green grid estimates (columns) together with their 95% confidence intervals (error bars) and the quarterly MBIE values (dots). We can see that the MBIE derived values are a close match to the Green Grid point estimates for Q1 and Q4 and lie in the centre of the Green Grid 95% confidence intervals. However the Green Grid data appears to over-estimate consumption in Q2 and Q3 which correspond to the New Zealand heating season. This is particularly noticeable for Q2 where the MBIE value is only just within the Green Grid 95% confidence interval. Once again, this reflects the Green grid sample bias towards family households with higher than average electricity demand in winter.

dailyDT[, q := ifelse(r_date >= lubridate::as_date("2015-04-01"), "Q2", "Q1")]
dailyDT[, q := ifelse(r_date >= lubridate::as_date("2015-07-01"), "Q3", q)]
dailyDT[, q := ifelse(r_date >= lubridate::as_date("2015-10-01"), "Q4", q)]
dailyDT[, r_month := lubridate::month(r_date)]
#table(dailyDT$q, dailyDT$r_month)

qDT <- dailyDT[, .(mean = mean(sumkWh),
                  sd = sd(sumkWh),
                  n = uniqueN(linkID)), keyby = q]

qDT[, ci_lower := mean - (sd/sqrt(n))*qnorm(0.975)]
qDT[, ci_upper := mean + (sd/sqrt(n))*qnorm(0.975)]

# a quarter is ~ 90 days but...
qDT[q == "Q1", mbie_kWh := 1426/as.double(difftime("2015-04-01","2015-01-01"))]
qDT[q == "Q2", mbie_kWh := 1829/as.double(difftime("2015-07-01","2015-04-01"))]
qDT[q == "Q3", mbie_kWh := 2377/as.double(difftime("2015-10-01","2015-07-01"))]
qDT[q == "Q4", mbie_kWh := 1670/as.double(difftime("2015-12-31","2015-10-01"))]

ft <- colformat_num(flextable(qDT), digits = 2, na_str = "N/A")
ft <- colformat_num(ft, j = "n", digits = 0, na_str = "N/A")
set_caption(ft, 
            caption = "Mean daily household kWh by quarter (2015)")
Table 3.9: Mean daily household kWh by quarter (2015)

q

mean

sd

n

ci_lower

ci_upper

mbie_kWh

N/A

N/A

N/A

7

N/A

N/A

N/A

Q1

16.56

7.85

32

13.84

19.28

15.85

Q2

24.40

12.87

32

19.94

28.86

20.10

Q3

28.36

14.44

29

23.11

33.62

25.84

Q4

18.93

9.50

28

15.41

22.45

18.34

ggplot2::ggplot(qDT, aes(x = q, y = mean, fill = q)) +
  geom_col() +
  scale_fill_discrete(name = "Quarter") +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), colour = "grey50") +
  geom_point(aes(y = mbie_kWh))
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (geom_point).
Quarterly mean daily kWh

Figure 3.4: Quarterly mean daily kWh

Table 3.10 shows the results of multiplying each of the point estimates (the mean) and the upper and lower confidence intervals by the population size (1,796,000) to estimate the total daily electricity consumption for all New Zealand households by quarter. As would be expected given Table 3.9, the uncertainty in the estimates is very large with the 95% confidence interval of ~19 GWh in Q3 roughly equivalent to the daily output of all New Zealand geothermal generation in Q3 2015 (~ 20 GWh/day - see MBIE data Table 1) or about 16% of total net generation. This level of uncertainty would mean that generalisation for policy purposes would be far from robust.

qDT[, scaledSum := mean * rmdParams$NZ_HHs]
qDT[, scaledCI_lo := ci_lower * rmdParams$NZ_HHs]
qDT[, scaleduCI_hi := ci_upper * rmdParams$NZ_HHs]

t <- qDT[, .(q, n, GWh = scaledSum/1000000, 
                        GWh_lower = scaledCI_lo/1000000, 
                        GWh_upper = scaleduCI_hi/1000000)]
t <- t[, CI := GWh_upper - GWh_lower]

ft <- flextable(t[!is.na(q), .(q, sampleN = n, popN = rmdParams$NZ_HHs, GWh, GWh_lower, GWh_upper, CI)])

ft <- colformat_num(ft, digits = 2)
ft <- colformat_num(ft, j = c("sampleN", "popN"), digits = 0, na_str = "N/A")

set_caption(ft, 
            caption = "Mean total daily household GWh by quarter (scaled, 2015)")
Table 3.10: Mean total daily household GWh by quarter (scaled, 2015)

q

sampleN

popN

GWh

GWh_lower

GWh_upper

CI

Q1

32

1,721,000

28.50

23.82

33.18

9.36

Q2

32

1,721,000

42.00

34.32

49.67

15.35

Q3

29

1,721,000

48.81

39.76

57.86

18.09

Q4

28

1,721,000

32.58

26.52

38.63

12.12

As before, we have re-calculated the 95% confidence intervals using n = 280 to simulate the potential precision for the HEEP2 full sample. Table 3.11 shows the results of doing so - the uncertainty in Q4 is now reduced to a 95% confidence interval of ~ 6 GWh, roughly equivalent to the daily wind generation of 7.5 GWh in Q4 2015. While the precision has improved, there is still substantial uncertainty illustrating the difficulty of making national level estimates from relatively small samples.

qDT[, ci_lower_HEEP2 := mean - (sd/sqrt(280))*qnorm(0.975)]
qDT[, ci_upper_HEEP2 := mean + (sd/sqrt(280))*qnorm(0.975)]

qDT[, scaledSum := mean * rmdParams$NZ_HHs]
qDT[, scaledlCI_lo := ci_lower_HEEP2 * rmdParams$NZ_HHs]
qDT[, scaleduCI_hi := ci_upper_HEEP2 * rmdParams$NZ_HHs]

t <- qDT[, .(q, GWh = scaledSum/1000000, 
                        GWh_lower = scaledlCI_lo/1000000, 
                        GWh_upper = scaleduCI_hi/1000000)]
t <- t[, CI := GWh_upper - GWh_lower]

ft <- flextable(t[!is.na(q), .(q, sampleN = 280, popN = rmdParams$NZ_HHs, 
                               GWh, GWh_lower, GWh_upper, CI)])
ft <- colformat_num(ft, digits = 2)
ft <- colformat_num(ft, j = c("sampleN","popN"), digits = 0, na_str = "N/A")

set_caption(ft, 
            caption = "Simulated mean total daily household GWh by quarter (n = 280, scaled, 2015)")
Table 3.11: Simulated mean total daily household GWh by quarter (n = 280, scaled, 2015)

q

sampleN

popN

GWh

GWh_lower

GWh_upper

CI

Q1

280

1,721,000

28.50

26.92

30.09

3.16

Q2

280

1,721,000

42.00

39.40

44.59

5.19

Q3

280

1,721,000

48.81

45.90

51.72

5.82

Q4

280

1,721,000

32.58

30.66

34.49

3.83

3.2.2 Proportional scaling

In this example we adjust the daily kWh to account for the distribution of the number of dwelling occupants (1,2,3,4+). To do this we need to know the proportion of these households in both the Green Grid sample and the closest (2013) NZ census. Note that the total number of households in 2013 was 1,549,875 rather than the 2015 estimate of 1,796,000 as used above.

Table 3.12 shows the distribution of households of different sizes in the Green grid 2015 sample compared to the 2013 Census data. This table clearly shows that the Green grid sample under-represents single person households and over-represents 3-4+ person households. Given that larger households tend to use more electricity (see mean kWh column) adjusting for these over-representations may reduce the bias in the sample mean and thus in a scaled population estimate.

# GG data
occDT <- dailyAllDT[hhDT][!is.na(nPeople_Factor) & 
                   !is.na(season), .(nHHs_GG = uniqueN(linkID),
                     meanDailykWh = mean(sumkWh),
                     sd = sd(sumkWh)), keyby = .(nPeople = nPeople_Factor)]
occDT[, pc_HHS_GG := 100*(nHHs_GG/sum(nHHs_GG))]

# census data
c_occDT <- censusOccupancyDT[Year == 2013, .(nHHs_census = sum(Value)), keyby = .(nPeople = nPeople_Factor)]
c_occDT[, pc_HHS_census := 100*(nHHs_census/sum(nHHs_census))]

setkey(occDT, nPeople)
setkey(c_occDT, nPeople)

occDT <- occDT[c_occDT]

ft <- set_caption(flextable(occDT[,.(nPeople, nHHs_GG, pc_HHS_GG, meanDailykWh, nHHs_census, pc_HHS_census)]),
            caption = "Mean daily kWh by household size and distribution of household sizes (Green Grid 2015, Census 2013)")
ft <- colformat_num(ft, digits = 2)
colformat_num(ft, j = c("nHHs_GG", "nHHs_census"), digits = 0, na_str = "N/A")
Table 3.12: Mean daily kWh by household size and distribution of household sizes (Green Grid 2015, Census 2013)

nPeople

nHHs_GG

pc_HHS_GG

meanDailykWh

nHHs_census

pc_HHS_census

1

3

8.57

8.45

355,284

22.92

2

10

28.57

22.28

527,676

34.05

3

7

20.00

23.53

254,862

16.44

4+

15

42.86

23.30

412,071

26.59

Table 3.13 shows the mean daily kWh by household size together with their 95% confidence intervals confidence intervals. We can see that these are extremely wide due to the small counts within each occupancy category. Indeed the 95% CI for the single person households spans 0… As above, we then simulate the 95% CI using the HEEP2 full sample size of 280 to show how the uncertainty would reduce with a larger sample size.

occDT[, ci_lower := meanDailykWh - (sd/sqrt(nHHs_GG))*qnorm(0.975)]
occDT[, ci_upper := meanDailykWh + (sd/sqrt(nHHs_GG))*qnorm(0.975)]

# distribute the HEEP2 280 according to census %
occDT[, nHEEP2 := 280 * pc_HHS_census/100]

occDT[, ci_lowerHEEP2 := meanDailykWh - (sd/sqrt(nHEEP2))*qnorm(0.975)]
occDT[, ci_upperHEEP2 := meanDailykWh + (sd/sqrt(nHEEP2))*qnorm(0.975)]

ft <- set_caption(flextable(occDT[, .(nPeople, nHHs_GG, meanDailykWh,
                                      sd, ci_lower, ci_upper,
                                      ci_lowerHEEP2, ci_upperHEEP2)]),
            caption = "Mean daily kWh by household size (Green Grid, 2015)")

ft <- colformat_num(ft, digits = 2)
colformat_num(ft, j = c("nHHs_GG"), digits = 0, na_str = "N/A")
Table 3.13: Mean daily kWh by household size (Green Grid, 2015)

nPeople

nHHs_GG

meanDailykWh

sd

ci_lower

ci_upper

ci_lowerHEEP2

ci_upperHEEP2

1

3

8.45

23.18

-17.79

34.68

2.77

14.12

2

10

22.28

10.95

15.50

29.07

20.09

24.48

3

7

23.53

15.64

11.94

35.11

19.01

28.04

4+

15

23.30

12.79

16.83

29.78

20.40

26.21

Repeating the multiplicative scaling process now requires us to multiple the mean daily values by the number of households in each occupancy category.

occDT[, dailyGWh := (nHHs_census * meanDailykWh)/1000000]
occDT[, scaledggCI_lo := (ci_lower * nHHs_census)/1000000]
occDT[, scaledggCI_hi := (ci_upper * nHHs_census)/1000000]
occDT[, scaledHEEP2CI_lo := (ci_lowerHEEP2 * nHHs_census)/1000000]
occDT[, scaledHEEP2CI_hi := (ci_upperHEEP2 * nHHs_census)/1000000]

occDT[, pc := 100*(dailyGWh/sum(occDT$dailyGWh))]

ft <- set_caption(flextable(occDT[, .(nPeople, nHHs_GG, nHHs_census, dailyGWh,
                                      "% of total" = pc,
                                      scaledggCI_lo, scaledggCI_hi,
                                      nHHs_HEEP2_full = nHEEP2, scaledHEEP2CI_lo,scaledHEEP2CI_hi)]),
            caption = "Scaled total daily GWh by household size (2013 Census)")
ft <- colformat_num(ft, big.mark=",")

colformat_num(ft, j = c("nHHs_GG", "nHHs_census", "nHHs_HEEP2_full"), digits = 0)
Table 3.14: Scaled total daily GWh by household size (2013 Census)

nPeople

nHHs_GG

nHHs_census

dailyGWh

% of total

scaledggCI_lo

scaledggCI_hi

nHHs_HEEP2_full

scaledHEEP2CI_lo

scaledHEEP2CI_hi

1

3

355,284

3.00

9.88

-6.32

12.32

64

0.99

5.02

2

10

527,676

11.76

38.73

8.18

15.34

95

10.60

12.92

3

7

254,862

6.00

19.75

3.04

8.95

46

4.84

7.15

4+

15

412,071

9.60

31.63

6.93

12.27

74

8.40

10.80

Aggregating the total GWh values across all household types after scaling in this way gives a mean daily residential GWh value of 30.4 GWh, compared to the 34.45 GWh MBIE value. This is likely to be caused by the use of the Census 2013 household counts (total households = 1,549,893) which are ~ 10% lower than both the ICP counts in the 2015 MBIE data (1,720,690) and the projected 2015 household counts (1,721,000) used in Section 3.2.1.

3.2.3 Multi-variate re-weighting

Weights created using IPF method and NZ Census 2013 area unit data as part of SPATIALEC project. Aggregate these to give ‘grossing’ weights for each Green Grid household. Then create a fractional weight so they sum to 1. Use weighted survey stats to produce estimates.

ggIPF_DT <- ggIPF_AU_DT[, .(grossWeight = sum(ipfWeight)), keyby = .(linkID)]

# now we need a fractional weight so sum to 1 across linkIDs
ggIPF_DT[, fracWeight := grossWeight/sum(ggIPF_DT$grossWeight)]

# link them
surveyData_DT <- ggIPF_DT[hhDT]

# % sum error

pcError <- 100*(1-sum(ggIPF_DT$grossWeight)/rmdParams$NZ_nHHs_Census_2013)

The grossing weights should sum to a close estimate of the number of Census households:

  • sum of grossing weights: 1
  • N Census 2013 households: 1549893

Difference: 488.971 or 0.03%

With such a small sample we will have a number of dwellings with identical weights. We will also have a few households with large weights.

summary(surveyData_DT$fracWeight)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## 0.003024 0.013480 0.013480 0.023810 0.024044 0.154906        4
dt <- ggIPF_DT[, .(n = .N), keyby = .(fracWeight = round(fracWeight,4), grossWeight = round(grossWeight, 4))]
ft <- flextable(dt)
set_caption(ft, caption = "Number of Green Grid households by fractional weight")
Table 3.15: Number of Green Grid households by fractional weight

fracWeight

grossWeight

n

0.0030

4684.929

2

0.0071

10975.493

6

0.0097

14998.792

1

0.0135

20886.696

19

0.0240

37254.174

6

0.0353

54636.578

1

0.0381

59059.202

1

0.0562

87095.568

2

0.0584

90514.934

2

0.0838

129880.847

1

0.1549

240012.522

1

sum(dt$fracWeight*dt$n)
## [1] 1.0001
sum(dt$grossWeight*dt$n)
## [1] 1549404
sum(dt$n)
## [1] 42
# let's plot the fractional weights just to illustrate the problem
ggplot2::ggplot(ggIPF_DT, aes(x = fracWeight)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

why do we get se = 0 for 1 person households? check weights :-)

setkey(dailyDT, linkID)
setkey(ggIPF_DT, linkID)

dailyWtDT <- dailyDT[ggIPF_DT] # will drop the ones with no weight

dailyWtDT <- dailyWtDT[!is.na(sumkWh) & !is.na(grossWeight)]

svy_dailyFracWt <- survey::svydesign(id = ~linkID, weights = ~fracWeight,
                           data = dailyWtDT)
# Fractional weights
sm <- svymean(~sumkWh, svy_dailyFracWt)
sm
##          mean     SE
## sumkWh 22.062 2.1003
confint(sm)
##           2.5 %   97.5 %
## sumkWh 17.94505 26.17798
smNP <- svyby(~sumkWh, by = ~nPeople_Factor, svy_dailyFracWt, svymean, vartype=c("se","ci"))
smNP
##    nPeople_Factor   sumkWh           se     ci_l     ci_u
## 1               1 16.90225 1.790235e-15 16.90225 16.90225
## 2               2 20.32065 2.686547e+00 15.05511 25.58618
## 3               3 25.56045 3.502871e+00 18.69495 32.42595
## 4+             4+ 23.77514 2.540849e+00 18.79517 28.75511
# why do we get se = 0 for 1 person households? check weights :-)
ft <- flextable(smNP)

set_caption(colformat_num(ft, digits = 2),
            caption = "Mean daily kWh by size of household")
Table 3.16: Mean daily kWh by size of household

nPeople_Factor

sumkWh

se

ci_l

ci_u

1

16.90

0.00

16.90

16.90

2

20.32

2.69

15.06

25.59

3

25.56

3.50

18.69

32.43

4+

23.78

2.54

18.80

28.76

smNP_DT <- as.data.table(smNP)
smNP_DT[, nPeople := nPeople_Factor]
setkey(smNP_DT, nPeople)
smNP_DT <- smNP_DT[c_occDT]
smNP_DT[, meanDailyGWh := (sumkWh * nHHs_census)/1000000]
smNP_DT[, ci_lower_GWh := (ci_l *nHHs_census)/1000000]
smNP_DT[, ci_upper_GWh := (ci_u *nHHs_census)/1000000]

sum(smNP_DT$meanDailyGWh)
## [1] 33.03925
ft <- flextable(smNP_DT[,.(nPeople_Factor, meanDailyGWh,
                                    ci_lower_GWh, ci_upper_GWh)])
set_caption(colformat_num(ft, digits = 2),
            caption = "Mean daily GWh by size of household")
Table 3.16: Mean daily GWh by size of household

nPeople_Factor

meanDailyGWh

ci_lower_GWh

ci_upper_GWh

1

6.01

6.01

6.01

2

10.72

7.94

13.50

3

6.51

4.76

8.26

4+

9.80

7.74

11.85

Try the gross weight - the mean should be the same.

svy_dailyGrossWt <- survey::svydesign(id = ~linkID, weights = ~grossWeight,
                           data = dailyWtDT)

sm_gw <- svymean(~sumkWh, svy_dailyGrossWt)
confint(sm_gw)
##           2.5 %   97.5 %
## sumkWh 17.94505 26.17798
message("Can we use svytotal? Only if there are no missing days")
## Can we use svytotal? Only if there are no missing days
nDays <- dailyDT[, .(nDays = .N), keyby = linkID]
summary(nDays$nDays)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   162.0   279.0   231.6   349.2   365.0
message("There are missing days if the mean is not 365")
## There are missing days if the mean is not 365

Identify key attributes of the ‘groups’ with very large weights.

temp_wdt <- ggIPF_DT[hhDT]

temp_wdt[, gWtFactor := as.factor(round(grossWeight, 0))]

t <- with(temp_wdt, table(gWtFactor, nPeople))
kableExtra::kable(t, caption = "N people by grossWeight") %>% kable_styling()
Table 3.17: N people by grossWeight
1 2 3 4 5 6
4685 0 2 0 0 0 0
10975 0 0 6 0 0 0
14999 0 0 0 1 0 0
20887 0 0 0 13 5 1
37254 0 6 0 0 0 0
54637 0 1 0 0 0 0
59059 0 0 1 0 0 0
87096 2 0 0 0 0 0
90515 2 0 0 0 0 0
129881 0 0 1 0 0 0
240013 0 1 0 0 0 0
t <- with(temp_wdt, table(gWtFactor, nAdults))
kableExtra::kable(t, caption = "N adults by grossWeight") %>% kable_styling()
Table 3.17: N adults by grossWeight
1 2 3
4685 2 0 0
10975 1 5 0
14999 0 1 0
20887 0 16 3
37254 0 6 0
54637 0 1 0
59059 0 1 0
87096 2 0 0
90515 2 0 0
129881 0 0 1
240013 0 1 0
t <- with(temp_wdt, table(gWtFactor, hasPV, useNA = "always"))
kableExtra::kable(t, caption = "N have PV by grossWeight") %>% kable_styling()
Table 3.17: N have PV by grossWeight
No yes NA
4685 2 0 0
10975 6 0 0
14999 1 0 0
20887 17 2 0
37254 6 0 0
54637 1 0 0
59059 1 0 0
87096 0 2 0
90515 2 0 0
129881 1 0 0
240013 1 0 0
NA 2 0 2
t <- with(temp_wdt, table(gWtFactor, `Heat pump number`, useNA = "always"))
kableExtra::kable(t, caption = "N heat pumps by grossWeight") %>% kable_styling()
Table 3.17: N heat pumps by grossWeight
1 3 NA
4685 1 0 1
10975 4 0 2
14999 0 0 1
20887 14 1 4
37254 2 0 4
54637 0 0 1
59059 0 0 1
87096 0 1 1
90515 0 0 2
129881 1 0 0
240013 0 0 1
NA 1 0 3
t <- with(temp_wdt, table(gWtFactor, Q7labAgg, useNA = "always"))
kableExtra::kable(t, caption = "Age of dwelling by grossWeight") %>% kable_styling()
Table 3.17: Age of dwelling by grossWeight
< 1999 >= 2000 NA
4685 2 0 0
10975 4 1 1
14999 0 1 0
20887 14 5 0
37254 3 2 1
54637 1 0 0
59059 1 0 0
87096 1 1 0
90515 1 1 0
129881 0 1 0
240013 1 0 0
NA 1 1 2

3.3 Model-based estimates

Develop a simple daily kWh model

hist(dailyDT$sumkWh)

# skewed (of course)
dailyWinterAggDT <- dailyDT[q == "Q2" | q == "Q3", .(meanDailykWh = mean(sumkWh, na.rm = TRUE)), keyby = linkID]
hist(dailyWinterAggDT$meanDailykWh)

# still skewed (of course)

# test with a qq plot
dailyWinterAggDT[, logkWh := log(meanDailykWh)]
qqnorm(dailyWinterAggDT$logkWh); qqline(dailyWinterAggDT$logkWh, col = 2)

shapiro.test(dailyWinterAggDT$logkWh)
## 
##  Shapiro-Wilk normality test
## 
## data:  dailyWinterAggDT$logkWh
## W = 0.93879, p-value = 0.06918
# if p > 0.05 => normal 
# is it? Beware: shapiro-wilks is less robust as N -> 

setkey(dailyWinterAggDT,linkID)
modelDT <- dailyWinterAggDT[hhDT[, .(linkID, nPeople_Factor, Q7labAgg, Location, Q20_coded)]]
skimr::skim(modelDT)
Table 3.18: Data summary
Name modelDT
Number of rows 46
Number of columns 7
_______________________
Column type frequency:
character 5
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
linkID 0 1.00 5 6 0 46 0
nPeople_Factor 4 0.91 1 2 0 4 0
Q7labAgg 4 0.91 6 7 0 2 0
Location 2 0.96 8 10 0 2 0
Q20_coded 2 0.96 0 31 2 9 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
meanDailykWh 14 0.7 26.30 10.94 13.98 18.26 21.04 32.63 53.37 ▇▂▃▂▁
logkWh 14 0.7 3.19 0.39 2.64 2.90 3.05 3.49 3.98 ▇▇▅▅▂
modelDT[, hasElecHeat := ifelse(Q20_coded == "Heat pump" | 
                                  Q20_coded == "HRV or other ventilation system" |
                                  Q20_coded == "Portable electric heaters", "Yes", "No")]

m <- lm(logkWh ~ Location  + Q7labAgg +
           hasElecHeat + nPeople_Factor, data = modelDT)

summary(m)
## 
## Call:
## lm(formula = logkWh ~ Location + Q7labAgg + hasElecHeat + nPeople_Factor, 
##     data = modelDT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6157 -0.3327  0.0000  0.2972  0.7529 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.03093    0.52565   5.766 8.43e-06 ***
## LocationTaranaki -0.07109    0.21252  -0.334    0.741    
## Q7labAgg>= 2000  -0.08899    0.20134  -0.442    0.663    
## hasElecHeatYes    0.02884    0.21268   0.136    0.893    
## nPeople_Factor2   0.19670    0.51403   0.383    0.706    
## nPeople_Factor3   0.35621    0.54700   0.651    0.522    
## nPeople_Factor4+  0.19336    0.53829   0.359    0.723    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4394 on 22 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.06954,    Adjusted R-squared:  -0.1842 
## F-statistic: 0.274 on 6 and 22 DF,  p-value: 0.9432
set_caption(colformat_num(flextable(broom::tidy(m)), digits = 3),
            caption = "Model predicting log(mean daily kWh) 2015")
Table 3.18: Model predicting log(mean daily kWh) 2015

term

estimate

std.error

statistic

p.value

(Intercept)

3.031

0.526

5.766

0.000

LocationTaranaki

-0.071

0.213

-0.334

0.741

Q7labAgg>= 2000

-0.089

0.201

-0.442

0.663

hasElecHeatYes

0.029

0.213

0.136

0.893

nPeople_Factor2

0.197

0.514

0.383

0.706

nPeople_Factor3

0.356

0.547

0.651

0.522

nPeople_Factor4+

0.193

0.538

0.359

0.723

Not a great model: -ve adjusted r sq is a strong indicator of this

Test the model - diagnostics

message("# Diagnostic plots")
## # Diagnostic plots
plot(m)
## Warning: not plotting observations with leverage one:
##   6

message("# Normality of residuals")
## # Normality of residuals
hist(m$residuals)

qqnorm(m$residuals); qqline(m$residuals, col = 2)

shapiro.test(m$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  m$residuals
## W = 0.94209, p-value = 0.1137
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
message("# autocorrelation/independence of errors")
## # autocorrelation/independence of errors
car::durbinWatsonTest(m)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.09877985      2.105069    0.99
##  Alternative hypothesis: rho != 0
# if p < 0.05 then a problem as implies autocorrelation
message("# homoskedasticity: formal test")
## # homoskedasticity: formal test
car::ncvTest(m)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.10365, Df = 1, p = 0.29347
# if p > 0.05 then there is heteroskedasticity

message("# -> collinearity")
## # -> collinearity
car::vif(m)
##                    GVIF Df GVIF^(1/(2*Df))
## Location       1.678126  1        1.295425
## Q7labAgg       1.303368  1        1.141651
## hasElecHeat    1.454321  1        1.205952
## nPeople_Factor 2.149870  3        1.136062
# if any values > 10 -> problem
message("# -> tolerance")
## # -> tolerance
1/car::vif(m)
##                     GVIF        Df GVIF^(1/(2*Df))
## Location       0.5959028 1.0000000       0.7719474
## Q7labAgg       0.7672431 1.0000000       0.8759241
## hasElecHeat    0.6876061 1.0000000       0.8292202
## nPeople_Factor 0.4651444 0.3333333       0.8802337
# if any values < 0.2 -> possible problem
# if any values < 0.1 -> definitely a problem

#save the residuals via broom
library(broom)
resids <- broom::augment(m)
# plot fitted vs residuals
ggplot(resids, aes(x = .fitted, y = .std.resid)) + 
    geom_point(size = 1)
## Warning: Removed 1 rows containing missing values (geom_point).

message("Where there any NAs where the model failed?")
## Where there any NAs where the model failed?
summary(resids$.std.resid)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -1.497679 -0.918058 -0.075575  0.005251  0.799830  1.947064         1

Overall, not a very good model. But it serves the purpose…

4 R environment

To aid reproducibility:

R.version
##                _                           
## platform       x86_64-apple-darwin17.0     
## arch           x86_64                      
## os             darwin17.0                  
## system         x86_64, darwin17.0          
## status                                     
## major          4                           
## minor          0.2                         
## year           2020                        
## month          06                          
## day            22                          
## svn rev        78730                       
## language       R                           
## version.string R version 4.0.2 (2020-06-22)
## nickname       Taking Off Again

Additional CRAN packages used:

# this will print out the packages we loaded earlier and set up bibtex references
# it requires a bib file to exist (see yaml) and the references to be in this format

for(p in cranPackages){
  cat(" * ", p, " [@",p,"]\n", sep = "")
}
  • data.table (Dowle et al. 2015)
  • flextable (Gohel 2020)
  • ggplot2 (Wickham 2009)
  • here (Müller 2017)
  • lubridate (Grolemund and Wickham 2011)
  • pwr (Champely 2018)
  • hms (Müller 2018)
  • kableExtra (Zhu 2018)
  • readxl (Wickham and Bryan 2019)
  • survey (Lumley 2020)

Other packages used:

# this will print out the packages we loaded earlier and set up bibtex references
# it requires a bib file to exist (see yaml) and the references to be in this format

for(p in githubPackages){
  cat(" * ", p, " [@",p,"]\n", sep = "")
}
  • GREENGridData (Anderson and Eyers 2018)
  • dkUtils (Anderson 2018)

References

Anderson, Ben. 2018. DkUtils: A Bunch of Useful Little Functions. https://github.com/dataknut/dkUtils.

Anderson, Ben, and David Eyers. 2018. GREENGridData: Processing Nz Green Grid Project Data to Create a ’Safe’ Version for Data Archiving and Re-Use. https://github.com/CfSOtago/GREENGridData.

Anderson, Ben, David Eyers, Rebecca Ford, Diana Giraldo Ocampo, Rana Peniamina, Janet Stephenson, Kiti Suomalainen, Lara Wilcocks, and Michael Jack. 2018. “New Zealand GREEN Grid Household Electricity Demand Study 2014-2018,” September. https://doi.org/10.5255/UKDA-SN-853334.

Anderson, Ben, Tom Rushby, Abubakr Bahaj, and Patrick James. 2020. “Ensuring Statistics Have Power: Guidance for Designing, Reporting and Acting on Electricity Demand Reduction and Behaviour Change Programs.” Energy Research & Social Science 59 (January): 101260. https://doi.org/10.1016/j.erss.2019.101260.

Champely, Stephane. 2018. Pwr: Basic Functions for Power Analysis. https://CRAN.R-project.org/package=pwr.

Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. Data.table: Extension of Data.frame. https://CRAN.R-project.org/package=data.table.

Gohel, David. 2020. Flextable: Functions for Tabular Reporting. https://CRAN.R-project.org/package=flextable.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.

Lumley, Thomas. 2020. “Survey: Analysis of Complex Survey Samples.”

Müller, Kirill. 2017. Here: A Simpler Way to Find Your Files. https://CRAN.R-project.org/package=here.

———. 2018. Hms: Pretty Time of Day. https://CRAN.R-project.org/package=hms.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Wickham, Hadley, and Jennifer Bryan. 2019. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.

Zhu, Hao. 2018. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.